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Abstract 



Given the output of a data source taking values in a finite alphabet, we wish 
to detect change-points, that is times when the statistical properties of the source 
change. Motivated by ideas of match lengths in information theory, we introduce a 
novel non-paramctric estimator which we call CRECHE (CRossings Enumeration 
CHange Estimator). We present simulation evidence that this estimator performs 
well, both for simulated sources and for real data formed by concatenating text 
sources. For example, we show that we can accurately detect the point at which 
a source changes from a Markov chain to an IID source with the same station- 
ary distribution. Our estimator requires no assumptions about the form of the 
source distribution, and avoids the need to estimate its probabilities. Further, we 
establish consistency of the CRECHE estimator under a related toy model, by 
establishing a fluid limit and using martingale arguments. 
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1 Introduction and notation 



Suppose we are given the output of a data source, in the form of a string x of n symbols 
drawn from a finite alphabet A, but have no knowledge of the source's statistical prop- 
erties. It is a well-studied problem to consider whether the source is stationary or, if it 
is piecewise stationary, to estimate the change-points - that is, positions at which the 
source model changes. In Section [2| we review existing approaches to the change-point 
detection problem and describe some applications. 

This paper offers a new universal non-parametric perspective, motivated by ideas from 
information theory. Specifically, a substantial existing literature considers so-called 



'match lengths'. That is, as described in Definition 3.1, for each point i we can de- 
fine the match length to be the length of the shortest substring starting at i which 
does not occur elsewhere in the string. For a wide class of processes, consistent entropy 
estimators can be constructed from the match lengths, as described in Section [3} see for 
example |l5l Theorem 1]. 

Our approach is motivated by the idea of considering match positions T", chosen uni- 
formly at random from the places where a substring of maximal match length occurs. 
We consider creating a directed graph where position i is linked to T" defined in this 



way. We refer to this as Graph Model A- see Definition 4.2 for a formal definition. 



Heuristically, in a model with no change-points we believe that the will be approxi- 
mately uniformly distributed, and in a model with change-points the will tend to lie 
in the same region as i. We therefore define the crossings functions Clr^]) and CrlU) 
as follows: 

Definition 1.1. For any directed graph formed by linking i to T", given a putative 
change-point < j < n — 1 we write 

ClrU) = '■ k < i < T^} for the number of left-right crossings of j , (1) 

CrlU) = i^{k : T'l^ < j < k} for the number of right-left crossings of j . (2) 

In a model with a single change-point at n'j, we look to estimate 7. We use normalized 
versions of ClrU) ^"^^ CrlU) to define an estimator 7 of the change ratio. 

Definition 1.2. For any sequence of , using the definitions of ClrU) and Crl^]) 
from Definition\l.l\ define the normalized crossing processes 



i^LRU) = : and ifjRLij) = : , (3) 

n — J n J n 

the maximum function 

i){j)= max {il)LR {3 ) , ^/JRL ( J ) ) (4) 
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and estimate the change-point using the CRECHE ( CRossings Enumeration CHange 
Estimator) as 

7 = — aigmmiplj). (5) 

n 0<j<n-l 

The process iPlrU) has been designed via subtracting off the mean of Clr^J) (in a model 
with no change point), and is related to the conductance of the directed graph. 



In Section [Sj we prove that CRECHE 7 is i/n-consistent in a related toy model, which 
heuristically captures the key features of the piecewise stationary model. We consider 
sampling T^^ from certain mixtures of uniform distributions (Graph Model B) and prove 
the following theorem: 

Theorem 1.3. For random variables generated according to Craph Model B (see 



Definition 5.1), the estimator^ of Definition 1.2 is ^/n- consistent. That is, there exists 



a constant K, depending on a^, and 7, such that for all s: 



P(l7-7l>^)<f. (6) 



Proof. See Appendix |Aj □ 

In Section [6| we present simulation evidence that this estimator 7, applied to Graph 
Model A, performs well in situations where the source is piecewise stationary. As Figure 
|4] shows, our algorithm can even distinguish between the output of a first order Markov 
chain with stationary distribution /i and an IID process with the same distribution. Since 
most non-parametric methods are based on monitoring means or densities of symbols (see 
Section |2|, this illustrates a major advantage of our techniques, since we can efficiently 
partition texts that a density-based method would find indistinguishable. We hope that 
we could even distinguish higher order Markov sources, in a situation where crude bigram 
or trigram counts would similarly fail (or require prohibitive amounts of data). 

Our method even appears to give good results in situations with a change-point between 
non-stationary sources - as illustrated in Figures [5] and [6] by examples based on written 
language. This robustness to changes in the source model should not be a surprise since 
the theory of match lengths described in Section |3] holds for a range of independent, 
Markov and mixing sources. 

Further, we compare the two cases where T" are defined according to Graph Model A, 



as in Definition 4^ and Graph Model B, as in Definition We present simulation 
evidence that in these two cases the functions ipLR and ipRL have similar behaviour, and 
hence the estimator 7 performs similarly for Graph Model A and Graph Model B. 



3 



2 Change-point literature review 



The problem of detecting change-points is an important and well-studied one, with ap- 
plications in a range of fields listed in the book by Poor and Hadjiliadis jlQl PI]. For 
example, we mention bioinformatics [11], finance [2j, sensor networks [36], climate [8], 
analysis of writing style [121 US] computer security [31] and medicine [19] . Our ap- 
proach currently works in the case of finite alphabet sources, and is thus naturally suited 
to applications in bioinformatics, computer network intrusion detection and analysis of 
writing style. 

As reviewed for example in [29], many approaches to the change-point detection exist 
within a parametric framework. The general approach is to maximise the log-likelihood, 
with a penalty term that ensures the number of changes is not too large. For example, 
the binary segmentation algorithm of Scott and Knott [44] aims to detect changes in 
mean of normal samples, an approach extended in work of Horvath [25] to detection 
of changes of mean and variance. In general, as in [29j, it is possible to model many 
situations parametrically by supposing that between change-points, the data is IID from 
a model with fixed parameter ^j, where the parameter 9i is itself sampled from some 
prior distribution. This parametric problem has the simplifying feature that versions of 
the likelihood ratio test can be performed, and the work [29j concentrates on detection 
of multiple change-points in as computationally efficient a manner as possible. 

In contrast non-parametric methods, required when the laws of the random variables 
are not available, are less widely studied. The book by Brodsky and Darkhovsky [12] 
describes many such approaches, often based on detecting changes in the mean. Other 
non-parametric techniques include those based on ranks and order statistics [9], [23] , 
kernel-based methods [36] and approaches based on comparing empirical distribution 
functions before and after a putative change-point [H], [TH], [ID]- The paper ^21 extends 
this to consider the situation where the source is only observed indirectly or in the 
presence of noise. 

In particular, Ben Hariz, Wylie and Zhang [10] build on [18] to produce non-parametric 
estimators which offer optimal n-consistency (error in 7 of Of>{l/n)) under natural as- 
sumptions. However, this approach is built on detecting changes in empirical distribu- 
tions, and so requires the stationary distributions either side of the change-point to be 
different. In contrast, see Figure |4| our estimator can work well even in the case where 
the stationary distributions are the same. 

One further distinction to be drawn is whether the change-point is to be detected of- 
fline through a detailed analysis of the data sequence, or in real-time with streaming 
data. Results in the second (quickest detection) problem are extensively reviewed in the 
book by Poor and Hadjiliadis [40]. A range of objective and penalty functions can be 
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considered, giving rise to Shiryaev's problem, Lorden's problem and others. In essence, 
[lO] shows that many such problems can be analysed using optimal stopping theory, and 
algorithms based on versions of Page's CUSUM test can be shown to be optimal, as in 
the work of Pollak [39] and others. The current paper considers offline detection, but 
in future work we will describe an adaptation of our match position approach to the 
quickest detection problem, using match lengths as a proxy for log-likelihoods. 

Our approach to the problem of detection of a change of author or language, as illustrated 
in Section[6| should be contrasted with the approach of Giron, Ginebra and Riba pTjl^. 
These authors choose particular features, such as distributions of word lengths or local 
frequencies of known popular words, and apply standard change-point analysis to the 
resulting counts. A similar analysis of the homogeneity of texts is reviewed in [121 P169- 
178]. In contrast, our universal approach takes into account all features, by finding long 
repeated word patterns, and detecting variations from uniformity in their appearance. 



3 Match lengths and entropy estimation 

We use calculations based on match lengths as defined by Grassberger [2^ and adopt the 
notation of Shields |46j . That is, we consider a string x taking values in a finite alphabet 
A, which we may take to be {1, ... , |v4|} for simplicity. We write = (x^, • • • , Xn) for 
a finite subsequence. 

Definition 3.1. For a given string x, define the match length at i as 

^ = L^{x) = min |l : x'+^-^ ^ for all 1 < J < n, j ^ • (7) 

For a wide range of sources, it has been proved that these match lengths can be used 
to consistently estimate the entropy of data source X. Grassberger [2l] introduced L", 
and explained heuristically why the following result should be true: 

Theorem 3.2 (Shields). If match lengths are calculated for an IID or mixing Markov 
source X with entropy H, 

n-s>oo n log n H 

almost surely. 



Theorem 3.2 is given as Theorem 1 of [15], though the proof was completed in 
Shields [Ul Section 3] shows that ^ does not hold in general, suggesting that deter- 
mining the class of processes for which convergence holds is a difficult problem. How- 
ever, further progress was made by Kontoyiannis and Suhov [M], who extended the 
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convergence to the class of stationary ergodic finite alphabet processes under a Doeblin 
condition. In turn, Quas [H] extended this result to countable alphabets. 

Entropy estimators given by the left-hand side of ([s]) have the advantages of being non- 
parametric, computationally efficient and with fast convergence in n. In particular, 
they out-perform naive plug-in estimators which estimate probability mass functions p 
by empirical estimators p, and then use H{p) to estimate the entropy (see |20j for a 
detailed simulation analysis illustrating this). 

We can heuristically understand why the result ([s]) might hold, using insights given 
by the Asymptotic Equipartition Property for IID sources (see [IHl Theorem 3.1.2]), 
or Shannon-MacMillan-Breiman theorem for stationary ergodic sources (see [Ij). This 
latter result states that for a stationary ergodic finite alphabet source of entropy iJ, for 
m large enough, there exists a 'typical set' Tm of strings of length m such that: 



1. A random string lies in Tm with probability > 1 — e. 

2. Any individual string in Tm has probability G [2-™(-f^+^), 2-'"(-^-^)] ~ 2" 



Hence, if the substring of length m at point i is typical, that is x*^™"^ G 7^, it has 
probability ~ 2""^^, so we expect to see it ~ n2~^^ more times. This means that 
choosing m = (\ogn)/H, we expect to see x*"*"™"^ once more, so match length ~ 
{log n)/H. 

However, it is a delicate matter to convert this intuition into a formal proof, since there 
are complex dependencies between L" for distinct values of i. The proofs of results such 



as Theorem 3.2 and its later extensions in |15], [M] and [H] typically involve arguments 



involving the return times Rk, based on theorems taken from Ornstein and Weiss [371 EH]- 
Definition 3.3. Define Rk to be the time before the block is next seen: 

Rk = mm{t > 1 : = X*+i^}. (9) 



It is possible to directly estimate entropy using the return time. Kac's Lemma [28] 
shows that E[i?fc|X^ = xf] = l/P(Xf = xf), for stationary ergodic X. This intuition 
was developed by Kim [30j, who proved that E[log-Rfc] ~ kH converges to a constant for 
independent processes and by Wyner (see [SI1I52]), who proved asymptotic normality of 
(log Rk — kH) I \/~k under the same conditions. Corollary 2 of Kontoyiannis [33] extended 
this to general stationary X satisfying mixing conditions. 

A simpler problem to analyse is one where the output of the source is parsed (partitioned) 
into non-overlapping blocks, and the matches take place by a blockwise comparison (this 
means that 'overlapping matches' are avoided). For example, the Lempel-Ziv parsing 
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[53| breaks the source down into consecutive blocks formed as 'the shortest block 
not yet seen'. In this CclSG, clS described in Cover and Thomas [15], a natural question 
with applications to many data compression algorithms is to understand the asymptotic 
behaviour of Lm, the total length of the first m codewords. Aldous and Shields [3J proved 
asymptotic normality of Lm for IID equidistributed binary processes, a result extended 
by Jacquet and Szpankowski [26] to IID asymmetric binary processes. 

An even simpler matching was introduced by Maurer [33]. In this case, the output of 
the source is partitioned into blocks of fixed length £, and matchings sought between 
them. That is, we can define block random variables Zi = Xij_-^^^^^^_^_^ G A^, and see how 
long each block takes to reappear. 

Definition 3.4. For any j , define random variable 

Sj = mm{t > 1 : Zj+t = Zj}, (10) 
to be the return time of the jth block. 

Maurer [33] proved that log Si/ i converges to the entropy H if the source is IID binary, 
with a similar result proved for stationary T/^-mixing processes by Abadi and Galves in 
[1]. Johnson [27] proved a Central Limit Theorem for the average of logS*,, and hence 
consistency of the resulting entropy estimates. 



4 Sources with change-points and match positions 

As described in Section |3| previous work on match lengths has typically considered the 
case of a stationary or ergodic source process; that is, one with constant distribution 
over time. Next we extend this to a model with change-points. We consider the string x 
to be generated by the concatenation of two source processes /ii and /i2, with a sample 
of length wy and n{l — 7) of each. (This parameterization is the same as that used by 
[ig and [ID|). 

Definition 4.1. Sample two independent infinite sequences x{l), x{2), where x{i) = 
x(i)[f ~ Hi for i = 1,2. Given length parameter n and change-point ratio 7, define the 
concatenated process x by 

f x{l)i ifO<i<n-f-l, , . 

' \ x{2)i ifn-f <i<n-l. ^ ' 

There has been some work concerning the properties of such a concatenated source, 
though this has focussed on the case where 7 is known. Arratia and Waterman [HI [7] 
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consider the longest common subsequence between the x(l) and a; (2) process - in contrast 
in some sense we consider average common subsequences. The papers of Cai, Kulkarni 
and Verdii [13] and of Ziv and Merhav [S^ both consider the problem of estimating the 
relative entropy from one source to another. The first paper uses algorithms based 
on the Burrows- Wheeler transform and Context Tree Weightings, the second ^SSj defines 
empirical quantities which converge to the relative entropy. However, such analysis does 
not directly help us in the setting where 7 is unknown. 

We now define the match positions T" generated by Graph Model A: 

Definition 4.2 (Graph Model A). Taking match lengths L" as introduced in Defini- 



tion 3. 1 , write 5" for the positions of the match at i 



5r = |j : X, ' =x] ' , 1 < J < n,j ^ z| (12) 
and take T" chosen uniformly and independently at random among the elements of S^. 

Given a realisation of x, recall that we hope to detect the change-point - that is, to 
estimate the true value of 7. The idea is that substrings of a;(l) are likely to be similar 
to other substrings of x(l) (and similarly for x(2)). Hence we expect that if i < n7 — 1 
then will tend to be < n7 — 1 as well. Similarly, for i > n'j, we expect that T" will 
tend to be > n^y. We consider constructing a directed graph, with an edge between each 
i and the corresponding T", and define the crossings processes ClrU) ^iid Crl^J) as in 
Definition 11.11 



We will look to find j such that ClrU) ^^^d CrlU) small. However, consider j = 1; 
then Clr{1) = 1, and Crl{1) will be expected to be close to 1. This suggests that instead 
of simply minimising CiFiij) CrlIJ) over j, we should consider a normalized version 



of these quantities. The exact form of Definition 1.2 is motivated by the martingale 
arguments used in Appendix |A] below. 

We give theoretical and simulation results which address how close 7 and 7 are. We do 
not expect to be able to find the change-point exactly, but hope to prove a consistency 
result. We expect that as n gets larger, the problem will get easier, though this will 
be controlled by certain parameters, such as the entropy rates H{fii) and H{^2) and 
relative entropy rates D(yUi||/^2) and D(/i2||/^i)- 



5 Consistency of 7 for toy source model 

The theoretical analysis of 7 under Graph Model A is a complex problem. However, 
we prove consistency of 7 in a related scenario, where are generated as mixtures of 
uniform distributions, which we refer to as Graph Model B, as follows: 
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Definition 5.1 (Graph Model B). 

write 5l = (7 + (1 — 7)q;l) 6fi 
variables TP such that: 



Given parameters < < 1 and < an < 1, 
= (70; /J + (1 — 7)). Define independent random 



1. for each < i < 717 - 1, ¥{TJ' = j) = 



2. for each <i<n, F{Tp = j) = 



4 0<j<n7-l, 
^ n'j < j < n — 1. 

< j < wy — 1, 
w-f < j < n — 1. 



Theorem 1.3 proves that 7 is consistent in this case. The proof of Theorem 1.3 



built on a series of results, and described in Appendix |Aj First in Appendix |A.l we 
understand the behaviour of the crossings processes in a situation with no change-point. 
This establishes the martingale tools we will use and allows us to prove a fluid limit, as 
described in for example [15]. That is, we show that in a model with no change-point 
the normalized crossings process ip^R is a martingale, and use Doob's submartingale 
inequality to control the deviation of the crossing process from its mean. 



In Appendix |A.2[ we consider models with a change-point. We develop the previous 
argument to prove that again in this case functions related to tpiR are martingales, and 
hence control their difference from their mean. We use this to deduce where the crossing 
function will be minimised, and complete the proof of consistency of 7. 

Note that in order to prove consistency of 7, it is not enough to control the marginal 
distributions of iPlrU) and iPrlU)', we need uniform control of the crossings processes. 



Although our proof of Theorem 1.3 is based on Doob's submartingale inequality, we 
briefly mention that it is possible to gain an understanding of the crossings process in 
terms of empirical process theory. The link between these two methods is perhaps not 
a surprise, since similar relationships have been used for example by Wellner [19] . 

Recall that, given independent f/j ~ f/[0,l], then writing the empirical distribution 
function = ^ ^^=1 ^(^* — = sup^ \Fn{x) — x|, Kolmogorov [321 The- 

orem 1] proved that y/nDn converges in law to the so-called Kolmogorov distribution. 
This result can be understood in the context of Donsker's Theorem, which states that 
y/n{Fn{x) — x) converges in distribution to a Brownian bridge B{x) (see for example 
Theorem 3.3.1, p. 110]). The fact that the supremum of |-B(x)| has the Kolmogorov 
distribution can be proved using the reflection principle; see for example [TTJ Proposition 
12.3.4]. 



We can use related ideas to describe the crossings process ipLR of Definition |1.2| in the 
sense of finite dimensional distributions, in the context of the model without change- 



points used in Appendix A.l 
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Lemma 5.2. For each < i < n — 1, define T" independently uniformly distributed on 
{0, . . . ,n — 1}. The process ^/n {ip iR^an)) — > y/aW{a/{l — «)), in the sense of finite 

dimensional distributions. In particular, for fixed a the y/nipiRidn) — y N I 0, . 

V 1 - a/ 

However, in order to prove consistency of 7 we require uniform control of the crossings 
process, meaning that martingale tools are natural in this context. 



6 Simulation results 



We illustrate by simulation results how the function of Definition 1.2 behaves when 
T" are defined by match lengths, as in Graph Model A of Definition 4.2 Note that 

-2 

since < CLBiJ) < J and < CrlU) < n- j, we know that < V^lrO') < ^^(^ 

See Figure 1 for a schematic illustration of the envelopes 



and < ijRLij) < 

of these functions. 



nj 



As Figure [T] might suggest, the function ■ip(j) can take large positive values for j close 
to or n. However, since we are looking for the minimum value of ip, this does not 
affect the analysis. In Figure |2] we illustrate how ip{j) behaves in a null model with no 
change-point. Observe that remains close to zero except at the end points, where 
it can take large positive values, as we would hope. 



o 

-1/2 
-1 





Figure 1: Schematic diagram of bounds on ipLR, i^RL and ip. Red curves bound values 
of 'ipLR, green curves bound ipRi, shaded region is envelope of possible values of ip- 

In Figure [3] we plot values of ipU) a model formed by concatenating two IID sources 



in the sense of Definition 4.1 The change-point is marked by a vertical red line, and the 
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Figure 2: Values of simulated from Graph Model A with a source with no change- 
point. 



function is minmised very close to this point, as we would hope. Further, in Figure 
|3| the form of the process observed fits closely with the theoretical properties of the 
corresponding process for T" generated by a toy model as in Section [sj Specifically, 
the function remains close to a piecewise smooth function, except close to the ends 
of the interval. Further, the piecewise smooth function is made up of three components; 
a concave function, a linear part, and another concave function. We explain how this 



pattern might be expected in Remark A.6 below 



We illustrate in Figure |4] how the algorithm performs over repeated trials simulated un- 
der Graph Model A. The histogram illustrates that the algorithm generally performs 
well, with a defined peak in estimates 7 close to the true value 7. In particular, 
Figure |4] represents a solution to a difficult problem, in that it shows that our algo- 
rithm can efficiently partition a concatenation of a Markov chain with transition matrix 
0.1 0.5 0.4 \ 

0.3 0.4 0.3 with stationary distribution (0.3, 0.4, 0.3) and an IID source with 
0.5 0.3 0.2 / 

distribution (0.3,0.4,0.3). Methods based on crude symbol counts would fail here, but 
the algorithm essentially 'discovers' non- uniformity in the digram counts. The skewness 



of the histogram is perhaps to be expected, given the fact that Equations (33) and (35) 



below are not equal (these Equations bound the performance of the related toy Graph 
Model B). 

Even when the two sources are not stationary, our estimator 7 appears to detect the 
change-point accurately. That is. Figures |5] and |6] illustrate that our estimator accurately 
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j x10' 

Figure 3: Values of simulated from Graph Model A with a source with a change- 
point at a position marked by a vertical line. The source is generated by concatenating 
10,000 symbols drawn IID from the distribution (0.1, 0.3, 0.6) with 40,000 symbols drawn 
IID from the distribution (0.5,0.25,0.25). 

detects the change-point in models built up by concatenating natural language. In other 
words, in both figures, the function '0(j) is minimised very close to the vertical line. The 
source of Figure |5] is formed by concatenating German and English versions of Faust, 
having sanitised the German text to remove umlauts, in order to make it look as English 
as possible. Figure [6] depicts a switch between two English authors. 

Note that the value of ^'(t) is lower for Figure |5] than for Figure [6| illustrating the 
natural idea that two English authors are harder to distinguish than two authors writing 
in different languages. This fits with the simulation evidence provided in p31 Section V] , 
where different languages, and different authors writing in English, are distinguished by 
relative entropy estimates. The authors suggest [131 Figures 15 and 17] that the relative 
entropy from English to German and from German to English are both around 2.5-2.6, 
whereas the relative entropy from one English author to another is typically around 0.3. 
However, note that the paper [13j considers a different situation, in that they consider 
a corpus of separate texts with authors already distinguished, whereas this paper shows 
how to partition a text by authorship. 
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Figure 4: Values of 7 based on repeated trials from Graph Model A with a source 
with a change-point at 7, marked by a vertical line. In each case, we take n = 15, 000, 
and the source is generated by concatenating ri'-f symbols drawn from a Markov chain 
with stationary distribution (0.3,0.4,0.3), with n{l — 7) symbols drawn IID from the 
distribution (0.3,0.4,0.3). The first three figures represent (a) 7 = 1/3 (b) 7 = 1/2 (c) 
7 = 2/3. The fourth figure shows the empirical average of the curve for the different 
values of 7. In each case, the plot is based on 1000 trials. 
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Figure 5: Values of ipU) generated from Graph Model A with a source which switches 
from German to English versions of Faust at the position marked by a vertical line. 




Figure 6: Values of iplj) generated from Graph Model A with a source which switches 
from between English authors at the position marked by a vertical line. 
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7 Discussion 



In this paper we have introduced a new change-point estimator, based on ideas from 
information theory. We have demonstrated that it works well for a variety of data 
sources, and proved A/n-consistency in a related toy problem. We believe that the 
CRECHE 7 can be adapted to detect change-points in a variety of related scenarios, 
and point out some directions for future research. 



1. First, we hope to prove consistency of 7 under Graph Model A, by establishing a 



version of Theorem L3 This is likely to require an analysis of return times similar 
to those described in Section [3| taking into account the complicated dependencies 
that exist between return times of distinct and overlapping substrings. However, 



we regard Theorem 1.3 as a significant first step towards proving such a result, 
since the simulation results presented in this paper suggest that the estimator 
behaves similarly in both cases. 

We note that, under Graph Model A, we expect the rate of convergence of 7 to 



7 to be quicker than the Op(l/y^) obtained in Theorem A.l, and perhaps even 
comparable with the Op(l/n) obtained by [10]. This is because a joint version 
of the Asymptotic Equipartition Property suggests that a typical string of length 
O(logn) from will have /i2-probability decaying like 0{n~'^) for a certain con- 
stant. This suggests that in terms of the toy model, we should consider crossing 



probabilities ai and decaying to 0. Remark |A.7| below shows that in the case 



c^L = otR = 0, much faster convergence is achieved in the toy model. 

2. Second, we believe that these consistency results should extend to scenarios with 
multiple change-points (assuming the number of change-points is low compared to 
the length of the data stream). In this case, simulations show that should 
have several local minima, each corresponding to a change-point, but the analysis 
required to prove this is more involved. 

3. Third, we believe that estimators of CRECHE type can be extended to real- valued 
data, as opposed to those coming from finite alphabets. In this setting, we should 
be able to construct a directed graph using closest matchings in Euclidean distance, 
motivated by ideas from rate-distortion theory. We can then use the crossings 
function in precisely the same way. 

4. Finally, in future work we will address the issue of quickest detection of change- 
points in streaming data, in the spirit of jlQ]. By estimating the typical set during 
the burn-in period, we believe that match lengths can act as a proxy for the log- 
likehhood in the CUSUM test. 
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A Proof of Theorem 



1.3 



A.l Matchings in an IID setting 

First, we consider the behaviour of the crossings function in a simpler situation than 



the Graph Model B of Definition 5.1, by considering a model without a change-point, 



analogous to Figure [2j We obtain uniform control of the type required. 

Theorem A.l. For each < i < n — 1 define independently uniformly distributed 



on {0, ... ,n— 1}. For the normalized crossings process iPlrU) of Definition 1.2, for any 
< a < 1 and s > 0, 

P I sup li^Uj)] >4=]< (13) 



that is, <y\il'LRij)\ < < j < (1 — a)n ^ is a pathwise (1 — e) confidence region on 
the process. 



The control of \ipLR{j)\ provided by Theorem A.l is of optimal order, in the following 
two senses: 

Remark A. 2. 



1. We cannot improve the order (in n) of the uniform bound. By Lemma 5.2, the 



nil)LR{n{l — a)) — )■ A^(0, (1 — aY /a), so that 



liminfPl sup \iPlrU)\>^\ > liminf P ( I^^^^Kl - a))| > 



n^oo \ o<j<n{l-a) \/n I n^oo \ ^/n 



2(l-$(^)). (14) 



2. We cannot expect to control iPlrU) uniformly in all j < n — 1 to the same order 
of accuracy, as the widening envelope in Figure [I] might suggest. Specifically, since 
ChR^n — 1) ~ Bin (n — 1, 1/n) — )■ Po (1), for any 5 < 1, 

liminf P ( sup \i\)lr{3)\ > 5 ) > liminf P(CLij(n - 1) = 0) = (15) 

n-s>oo Vo<i<n-l / "^'^ 
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Remark A. 2 helps to explain the large fluctuations in ^p{j) seen in Figure [2j In this toy 
model with no change-point: for j < n{l — a), the maximal fluctuations of iPlrU) are 
Of{l/^/n), but for j < n, the maximal fluctuations are Op(l). Similarly, fluctuations in 
i^Riij) will be Op{l/y/n) for j bounded away from zero, and Op(l) overall. 

We first prove a technical lemma regarding the thinning operation introduced by Renyi 
[42j . That is, for each random variable Y, the a-thinned version (a) oY = YlJ=i -^i"^ 
where sf"^ are Bernoulli(a), independent of each other and of Y. This allows us to 
describe a process with binomial marginals which will prove useful for us. In the language 
of [S] this process is a (non-stationary) first-order integer- valued autoregressive INAR[1) 
process, a discrete equivalent of an AR(1) time series process. 

Lemma A. 3. For fixed N and /3, define a process (Yj) by Yq = 0, and recursively taking 

+ (16) 



N-j 

where Uj ~ Bern (^ ^^^~^~^^ ^ independently of all other random variables. Then, 

1. For all j, the Yj ~ Bin (j, (3{N - j)/N). 

Y (3j 

2. The process Za = — is a martingale. 

^ N - j N 

f d / f d(3y 

3. For any d, the process Wj = I 1 -|- — : I / I 1 + ^ ) a martingale. 




N J 



Proof. 



1. Note that this result is true by definition for j = 0, we will prove it by induction in 
general. Recall that for any a, n and p, if y ~ Bin (n, p) then (a) oY ~ Bin [n, ap). 
Assuming Yj ~ Bin (j, (5{N — j)/N) for a particular j, then 

^j+i ~ 1 — ^7 — ) o Bm ( J, ) + Bern 



N-J J y N J V N 



N / V A^ 



- Bm(j + 1, 
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2. This means that KYj = := (5j{N — j)/N for all j. As a result, since 



E[Yj+i\Yj = m]=m 



N-j-1 , f3{N-j-l] 



+ 



N-j 

and since Zj = u exactly when Yj = + u{N — j): 



N 



K[Zj+i\Zj = u] 



E 



N-j-1 



=/i,+n(iV-j) 



+ 1: 

N 



/i, + n(iV-j) l\_ + 1) 
N-j N ) N 



by substituting for jij. 

3. Write a, = {N - j - 1)/{N - j), /3, = m - j - 1)/N, 7, = 1 + d/{N - j) and 
L = {l + d(3/N). By a similar argument, since 7^^ = u when Yj = logu/log7j = m 
say, we know that 



E 



7,- 



Yj = m 



n 



a. 



Xl-«,r7m(/3,7,+i + l-/3,) 



n=0 

7r^ = ^L, 



since aj'jj+i + 1 — aj = •yj and /3j7j+i + 1 — /3j = L. 



□ 



Proof of Theorem A.l . The key is to observe that for T uniform on {0,...,n — 1}, 
P(T = j|T > i) = P(T = j)/P(T > j) = l/(n - j). This means that the LR crossing 
process ClrU) is a Markov (birth and death) process. If we know that ClrU) = fn, 
then the m links that cross j will cross j + 1 independently with probability 1 — l/(n — j) . 



In addition, there will be a contribution due to Tj. 



A.3 



In other words, the process CiRij) is distribut ed ex actly as Yj in Lemma A.3, with 
N = n and /3 = 1. This means that by Lemma "'■ ^^^^^ 



J 



IS a mar- 

n — J n 



tingale. By a standard argument (see for example |50l Section 14.6]), since iPlrU) is 
a martingale, Jensen's inequality implies that iPlrUY is a submartingale. Doob's sub- 
martingale inequality [SDl Section 14.6] states that for any non-negative submartingale 
Vj, for any k and C: 



P ( sn^ Vj>C] < 



E\4 
C 



(17) 
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Since ClrIj) ~ Bin (j, {n - j)/n), the E^lrUY 
j^/n^{n — j), so we know that KipLR{n{l — a))^ 



Var ^IJLRij)=ye.TCLRij)/{n-jf 
{l-a)y{an). 



Hence, taking Vj = iPlrUY, C = s'^/n and k = n(l — a) in Equation (17), the theorem 
follows. □ 



A. 2 Matching in a change-point setting 



We now use the insights of Appendix A.l to control the behaviour of the crossings 
process iPlrU) for Graph Model B, where a change-point is present at wy. First we use 
Lemma lA. 31 to deduce that: 

Proposition A. 4. The process ZlhH) defined by 

ZUj) = I forO<j<nj-l, ^^^^ 

[ {iPlrH) - dLR,2{j)) forri'y -1 < j <n-l, 

is a martingale. Here mean functions 

^^^•^^^^ = — h — J' ^''^ 



Further Var Zm{i) equals 



f 

forO<j<n-f- 1, (21) 



aLl{aLj+l{l-aL)n) ^ {j - ^n){j - {I - aR)^n) _ ^ < ^. < ^ _ ^ ^22) 

6ln{n-j) 6\n^{n-j) 

Proof. The key is to observe that, under Graph Model B, for k < wy — 1: 

and for k > wj, the P(T^ = l\T^ > I) = l/{n — I) for I > wj. This means that 

1. For < J < n7-l, the Clr{j+1) ~ ( ''^\~^'^ ) oCLR{j)+BeTn '^'"^^ " ^ " ^ 



We deduce that ZLfi{j) is a martingale in this range and that CLR{j) ~ Bin ^j, "^^"-^ 
by applying Lemma |A.3 with = uSl and (3 = 1. We deduce the variance of 
ZlrU) since Var ZlrU) = (^^/.^^ Var ClrU). 

19 



2. For < J < n - 1, we divide Clr{j) = C^lU) + cS(j), where cS(j) = < 

(2) 

min(j, wj) : > j} and Q^O') = #{^7 < k < j : Tk > j}. As before 
(a) C^Lfiij + 1) ~ ^ — ^ o C{_^j^(j). In this case, since 

Eicab+i)icao)=-^]= '"'""^"''' 



we can divide by n — j — 1 to deduce that CL^{j)/{n — j) is a martingale. 
Further, ^^^(j) ~ Bin (^7, ^^). 

(b) CS(J + 1) ~ ( """^TM o + Bern ( ""'i'^ ) • In this case, by 

\ n - J / \ nOR J 

considering Y^. = C|^(n7 + s) (since if j = s + 727 then n — j = n{l — 7) — s) 

. /^(l-7)-s-l\ ^ /n(l-7-s-l 

we can write r^+i ~ 7 ^ ° + Bern 

V n{l - 7) - s J V ™R 

This means we can apply Lemma A. 3 with N = n{l — 7) and /3 = (1 — 7)/5_r, 

to deduce that — r — = — — is a martingale. As 

n(l-7)-s ndR n-j ndR 



before Cfi(j) ~ Bin - nj, ^ 



The fact that Z^rIJ) is a martingale follows since the sum of two independent 
martingales is a martingale. We deduce the mean and variance of ZlrIJ) since 

Var (ZlrU)) = (^^^ (Var C^l{j) + Var . 



□ 



Using this martingale characterization, and Doob's submartingale inequality Equation 



(17), we can control Z^r uniformly, as before. This allows us to control ipLR, as illus- 
trated in Figure [7j Essentially, the confidence regions for iPlrU) are tilted versions of 



the confidence region of Theorem A.l This means that the iPlrU) stay close to their 
mean functions for j < n{l — e), so that the minimum of iPlrU) must be close to the 
minimum of the mean functions, namely ri'j. This is illustrated in Figure [7} 

Remark A. 5. By symmetry, the process ZriIJ) defined by 

J (^rlU) - dRL,iij)) forO<j<n^-l, 
^''^^•^^ " ^ ^ ,.n,li-a,) ) (^RlU) - dR^Aj)) forn^-l<j<n-l, ^^^^ 
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Figure 7: Values of (a) iPlrU) (b) V^rlO') and (c) ^(j) = max{tpLRU),i^RL{j))- Data 
is generated under Graph Model B, with a change-point at wy = 4000. In this example, 
n = 10000, dL = CiR = 0.2 and 7 = 2/5. The function iIjlrU) stays close to the mean 
functions diR,! and dLR^2 except when j > 0.9n, as shown in (d). 
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is a time-reversed martingale. Here we write 



dn.M = -(?i^f2(lf^V (26) 



nj V 

For wy < j < n — 1 the corresponding CjiL{j) ~ Bin (^n — j, ) ' 

1 (n- i)^ 
Var ZrlU) = J-. ^ TT^Var Crl{j) = j-, ^- (27) 

Remark A. 6. Note that the form of d^R^i and dfn^i helps explain the form of the process 
seen in Figures^and^ That is, Equations (19) and (20) show that the mean of 



V'-L-rO) is made up of a concave part left of the change-point and a linear part right of 



the change-point. Similarly by Equations (25) and (26), the mean of tPrlU) will have a 



linear part left of the change-point and a concave part right of the change-point. 

In Figures^ and^we see that ip{j) remains close to the maximum of these two curves; 
first the concave diR,! before the change-point, then the linear dLR,2, followed by the 
concave dRL,2- The exact values of •y, ai and or will determine which curve is largest 
at a particular point. 

Notice that the curve dLR^j) made up of dLR,i{j) for j < ny — 1 and dLR^2{j) for j > 
is minimised at j = ny with value c?™]^ = dLR,i{nj) = dLR,2{ny) = —7^(1 — OiLj/^L- 
Similarly d^i{j) is minimised at j = ny with value = —(1 — 7)^(1 — OiR)/5R. 



In the proof of Theorem 1.3 we need to distinguish two cases, according to which of 
and d^^ is smaller. We briefly remark that in the symmetric case ul = O-r, that 
d^fl < if and only if 7 > 1/2. Further, in the limiting case ul = O-r = 0, the two 
curves diR and dRL intersect at j = n/2. 



Proof of Theorem \1.3[ Without loss of generality, we will assume that d^^j^ > d^f^, and 



pick e. Further we assume < 0, which is true if < 1. 

First, we observe that the curve ip cannot be minimised too close to either end of the 
interval of interest. We write e* = — rf™]^ — e. Recall that (see Figure[l| > iPlrU) > 
—j/n and > iPrlU) > —{n ~ j)/n. This means that for j < ne* we know that 

^lrU) > df^R + e, and for j > n{l - e*) we know that iPlrU) > df^ + e. 

This means that we can use the union bound and standard conditioning arguments to 
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decompose the error probability into three terms: 
1 



P 



n 



argmin'?/'(j) — 7 



> 



< ¥{^il){n-f) > + e) + P min < V'(^7) 

< P(^(n7)><" + e) 



+P min ^lrU) < dTn + e 



+P 



min ^^IJLR{3)<dTR+e]. 

n'y+s^n<i<n{l—t*) 



(28) 
(29) 

(30) 



using the fact that = msix{ipLR{j),i^RL{j))- We can bound each of these terms in 
order. 



1. Observe that by the union bound and the form of the mean functions in Equations 
(|20|) and (|26|), we can bound (g by 

n^{n^) > dt^ + e) < 



< 



Uni) > dTR + e) + P(^«L(n7) > d^l + e) 
P(Zifi(n7) > e) + ¥{aRZRL{n-i) > e) 

7^Ql ^ (1 -7)^«fl 
51(1 — 7)?T,e2 (5|,7ne^ 
1 



ai _^ Or 
1-7 7 



(31) 



since by Equation (21) the Var {Zmiji'-^)) 
Var (Z^,(n7)) = 



7 



and by Equation (27) the 



2. To bound (29), the key is to observe that the mean term d^R i defined in Equation 



(19) is a concave function. This means that for t > we know that 

tdLR,iin'j) t'j{l-aL) 



dLR,i{n'y -t)-d 



mm \ 
LR — 



ri'y 



n6r 



(32) 



As defined in Proposition A. 4, iPlrU) — df^^ is a multiple of ZLfi{j) with a 
coefficient which decreases in j, so for ne* < j < ri'j, we can bound it by 
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— — > > a^. This means that by Equations (|19|) and 

7(1-7) n-j 



(32) 



P ( min ^IjlrU) < dTn + e 



< P I {dLR,iin^ - s^) - df^) - 6l sup \ZLRij)\ 

OL\/n 7(1-7) V0<i<n7 



< e 



< 



2 

7aL + \ Var {ZLR{n-f)) 



7(1-7) / ( sj{l-aL) 



aiil - 7)^ (■57(1 - "l) - e^Lv^ 



2 ' 



by Doob's inequahty (17) and the variance expression (21) 



3. Similarly, using Equation (20), we know that 



dLRMl + t)-dl'^ 



t7(l - Ur) 



n6 



R 



meaning that 



P( min Mj)<dt'^ + e 

^ra7+sy'n<j<n(l— £*) 



(33) 



(34) 



Var {ZLR{n{l - e*))) 



sup IZlrU)] 1 < e 

ra7+sy'n<j<n(l — e* ) 



< 



S7(l-afl) 



— e 



1 + OlL 



2 ' 



(35) 



since ( 22 ) implies that 



Var {ZLR{n{l - e*))) 

^ /ai.7K(l-e*)+7(l-«L)) , (1 - e* - 7)(1 " e* - 7(1 " 



< ^(^ + 1 



ne* V 5 



SI 

7 + «L 
ne*5r 



SI 
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The result follows on adding together the contributions from Equations (31), (33) and 
(35). We can choose for example e = 7'^(1 — O-lIS / {Siy/n), since s/y/n < (1 — 7), since 



the assumption that rf™]^ > (P^^ ensures that e < 7(1 — 7)^(1 — aji)s / dny/n. Putting 
these terms together, we deduce that we can take 

(«i + 7'(l-ai)(l-7))' 



K 



+ 



51 



+ 



7 1-7776(1-0^)2 aL{l-'^'^Y{l-lf{l- aif 
(7 + "l) 



4 



7^(1 - aL){l - 7(1 - 7)) (7^(1 - - (1 - iff 



(36) 



□ 



Remark A. 7. Note that the form of (36) suggests that as tends to zero, then K will 
tend to infinity, meaning that this is the hardest case. Of course, the case = an = 
will have no crossings ofwy, so should be the easiest case. We can indeed do much better 
by adapting the argument slightly. Without loss of generality assume that 7 < 1/2, and 
recall that in this case c?™)^ = —7, and we can choose e = 0, so that e* = 7. This means 
that Equations (28) and (29) are zero, since Var Zm{n'-)) = 0, and since the interval 

n (1-27)2 
[ne*,n7 — Sy/n] is empty. Then taking ai = o^r in Equation (35) gives 



Overall, this means that 



7-7I > 



< 



:i-27)^ 



suggesting that the estimator is y/n- consistent in this case. 

In fact, we can do better. Since the interval [?T,e*,n7 — 1] is empty, we can strengthen the 
bound on (29) to deduce that P (min„e*<j<n7-i ^L_R(j) < ^Tr ^ ^ — 0- Further notice 
that when 7 = 1/2, the P(7 7^ 7) = 0, since the interval [wy + 1 < j < n{l — e*)] is 
again empty. 

Otherwise, we divide the interval into further subintervals, using a similar argument to 

{b — 7n)2 



that used to obtain (35). Since Equation (22) gives Var ZLji{b) 
any nj < a < b we mow that 



P 



minJjUj) < dtR 



< P( {dLR,i{a)-d 
7^)7 



mm 
LR 



) < sup 

a.<j<b 



1 — 7)2n2(n — b) 
ZlrM 



, for 



= P 



< 



,^ . < sup 

n(l - 7) a<j<b 



{b — 771,)^ 
[n — b)j'^{a — 7^)2 



Zlr{j)\ 
Var (ZLRib)) 



(37) 
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This means that we can pick a constant C > 1, and divide the interval [wy + 1, n{l — 7)] 
into suhintervals [a^^hk], where = wy + C'' andbk = min (717 + C^+^,n(l — 7)), where 
k = 0, . . . , K — 1, with K = log(n(l — 27))/ logC Applying the union bound to these 



intervals, we deduce by Equation (37) that 



n2 jy- 

7^7)<^-, (38) 
7"^ n 

or in other words that the probability that the estimator makes a mistake is 0{(\ogn)/n). 
Up to the factor oflogn, this probability is of optimal order, since for 7 < 1/2 indepen- 
dence implies that 

lim inf raP(7 7^ 7) 



n— >oo 



> 



lim inf nP ({^LR{nj + 1) < df^} fl {ijRL{n^ + 1) < d];''^} 



> limmf nF{CLR{nj + 1) = 0)F{CRL{n-f + 1) = 0) 

n—>oo 
-1 



e 



(1-7)' 

as CLR{n^ + 1) ~ Bern (^^J^) and CRL{ni + 1) ~ Bin (n{l - 7) - 1, ^t^) 
Po(l). 
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